Ethereum Stock Prices

  • Import data
    • Show prices, market capital and volume
    • Analysis
  • Preprocessing data
    • Complete the Index
    • Find NaN and Fix it
    • Closed Price Column
  • Split data into training and test datasets
    • Normalizing datasets
  • Building the LSTM model
    • Regressor
    • Sequence
      • Special Normalizations for Sequences
        • Custom: window steps by change rate
  • Testing the model
  • Conclusions

Import Libraries


In [1]:
import os
import io
import math
import random
import requests
from tqdm import tqdm
import numpy as np
import pandas as pd
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

import matplotlib.pyplot as plt
%matplotlib inline

Import Data and Analysis


In [2]:
def download_file(url, filename):
    r = requests.get(url, stream=True)

    total_size = int(r.headers.get('content-length', 0)); 
    block_size = 1024
    total_kb_size = math.ceil(total_size//block_size)
    
    wrote = 0 
    with open(filename, 'wb') as f:
        for data in tqdm(r.iter_content(block_size), total=total_kb_size , unit='KB', unit_scale=True):
            wrote = wrote  + len(data)
            f.write(data)

In [3]:
datafile = "eth-eur.csv"

#import from server
if not os.path.exists(datafile):
    download_file("https://www.coingecko.com/price_charts/export/279/eur.csv", datafile)


85.0KB [00:00, 531KB/s]

In [4]:
data = pd.read_csv(datafile)

#print a random sample
data.iloc[random.randint(0, data.shape[0])]


Out[4]:
snapped_at      2017-11-25 00:00:00 UTC
price                           391.764
market_cap                  3.75913e+10
total_volume                8.90435e+08
Name: 840, dtype: object

In [5]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1087 entries, 0 to 1086
Data columns (total 4 columns):
snapped_at      1087 non-null object
price           1087 non-null float64
market_cap      1086 non-null float64
total_volume    1087 non-null float64
dtypes: float64(3), object(1)
memory usage: 34.0+ KB

Here we can see that every sample is defined by the day in a date format, the current price, the capital market and the total volume of transactions that have been done that day.

At first glance, they are good indicators so all of them will be used as features.


In [6]:
#customize index
data.snapped_at[0].split()[0]
data.snapped_at = data.snapped_at.apply(lambda x: x.split()[0])

In [7]:
data.set_index('snapped_at', inplace=True)
data.index = pd.to_datetime(data.index)

In [8]:
features = ['price', 'market_cap', 'total_volume']

In [9]:
data[features].plot(subplots=True, layout=(1,3), figsize=(20,4));



In [10]:
data.iloc[0:10]


Out[10]:
price market_cap total_volume
snapped_at
2015-08-07 2.580213 0.000000e+00 8.257608e+04
2015-08-08 1.175306 7.095505e+07 3.250759e+05
2015-08-10 0.624116 3.772033e+07 3.634980e+05
2015-08-11 0.966607 5.844581e+07 1.375588e+06
2015-08-12 1.126292 6.813006e+07 1.858814e+06
2015-08-13 1.636673 9.904778e+07 3.927292e+06
2015-08-14 1.643557 9.951063e+07 3.920484e+06
2015-08-15 1.505036 9.116528e+07 2.269451e+06
2015-08-16 1.329391 8.055977e+07 2.730304e+06
2015-08-17 1.086774 7.882067e+07 1.697221e+06

Preprocessing Data

Complete the Index

The list is not complete (2015-08-09 is missing) so we have to fill the blanks.


In [11]:
#check
'2015-08-09 00:00:00' in data.index


Out[11]:
False

In [12]:
#Generate all the possible days and use them to reindex
start = data.index[data.index.argmin()]
end = data.index[data.index.argmax()]

index_complete = pd.date_range(start, end)
data = data.reindex(index_complete)

Now, the index is completed but the inexistent samples must be filled out.

Find NaN and Fix it


In [13]:
#Fill the blanks with the mean between the previous and the day after

for idx in data.index:
    dayloc = data.index.get_loc(idx)
    day = data.loc[idx]
    if day.hasnans:
        #updating
        rg = slice(dayloc-1, dayloc+2)
        data.loc[idx] = data.iloc[rg].mean()
        
        print("Day <{}> updated".format(idx))


Day <2015-08-09 00:00:00> updated
Day <2017-04-02 00:00:00> updated

In [14]:
#check
data.loc['2015-08-09 00:00:00']


Out[14]:
price           8.997108e-01
market_cap      5.433769e+07
total_volume    3.442869e+05
Name: 2015-08-09 00:00:00, dtype: float64

In [15]:
#Checking if we have NaN in another place
data[data.isnull().any(axis=1)].count()


Out[15]:
price           0
market_cap      0
total_volume    0
dtype: int64

Closed Price Column

Now we need to include a new feature which will define the closed price for every sample. Ethereum market is always open so we can forget about weekends and use directly the open price of the next sample.

Afterwards the model will use this feature as the target since it's the value we try to predict.

The following script will help us with that.


In [16]:
new_column = 'closed_price'
datab = data.copy()

nc = list()

for idx in data.index:
    dayloc = data.index.get_loc(idx)
    
    #we put the price in the day after as closed price
    if dayloc == len(data.index)-1:
        #last position will not have closed_price
        closed_price = np.nan
    else:
        closed_price = data.iloc[dayloc+1].price
    
    nc.append(closed_price)

data[new_column] = nc
data.tail(5)


Out[16]:
price market_cap total_volume closed_price
2018-07-25 408.412874 4.120839e+10 1.507674e+09 399.352156
2018-07-26 399.352156 4.030245e+10 1.335759e+09 395.804723
2018-07-27 395.804723 3.995254e+10 1.224490e+09 400.906575
2018-07-28 400.906575 4.047571e+10 1.129973e+09 399.159567
2018-07-29 399.159567 4.030749e+10 1.361944e+10 NaN

In [17]:
#Delete last because we don't know still the closed price 
data = data.drop(data.index[len(data)-1])

Split Data into Training and Test Datasets


In [18]:
#X_train, X_test, y_train, y_test = train_test_split(data[features], 
#                                                    data.closed_price, 
#                                                    test_size=0.20,
#                                                    shuffle=False,
#                                                    random_state=42)

#80% for training
split = round(len(data)*0.9)
data_train, data_test = data[:split].copy(), data[split:].copy()

In [19]:
print("Size data_train: {}".format(data_train.shape[0]))
print("Size data_test: {}".format(data_test.shape[0]))


Size data_train: 978
Size data_test: 109

Normalizing Datasets

Take care of this because we don't know if the future values are in the range. For this reason we'll fit the scaler using only the training data and NOT the testing data.

Standardization is a well known normalizer that uses the standard deviation thinking the dataset follows a Gaussian distribution (it's specially robust for new values outside of the expected values).

*Note: this method assumes the distribution of data fits to Gaussian distribution.


In [20]:
#Scale the data
scaler = StandardScaler()

data_train_norm, data_test_norm = data_train.copy(), data_test.copy()

data_train_norm[data.columns] = scaler.fit_transform(data_train[data.columns])
data_test_norm[data.columns] = scaler.transform(data_test[data.columns])

data_test_norm.describe()


Out[20]:
price market_cap total_volume closed_price
count 109.000000 109.000000 109.000000 109.000000
mean 1.408576 1.473316 1.474397 1.409487
std 0.367760 0.369896 0.709833 0.364502
min 0.819844 0.853063 0.447957 0.868519
25% 1.100298 1.163725 1.037712 1.100283
50% 1.315260 1.385547 1.381100 1.313598
75% 1.660170 1.715426 1.778152 1.658455
max 2.288482 2.361295 5.217228 2.286671

Building the Model

Check Tensorflow and GPU


In [21]:
from distutils.version import LooseVersion
import warnings
import tensorflow as tf

# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.0'), 'Please use TensorFlow version 1.0 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
if not tf.test.gpu_device_name():
    warnings.warn('No GPU found. Please use a GPU to train your neural network.')
else:
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))


TensorFlow Version: 1.0.0
/Users/samuel/anaconda/envs/py3/lib/python3.5/site-packages/ipykernel_launcher.py:11: UserWarning: No GPU found. Please use a GPU to train your neural network.
  # This is added back by InteractiveShellApp.init_path()

Regressor model

  • 1 step and 3 features

In [22]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Flatten

X_train = data_train_norm[features].values.reshape((data_train_norm.shape[0], 1, 3))
y_train = data_train_norm.closed_price.values

X_test = data_test_norm[features].values.reshape((data_test_norm.shape[0], 1, 3))
y_test = data_test_norm.closed_price.values


Using TensorFlow backend.

In [23]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)


(978, 1, 3)
(978,)
(109, 1, 3)
(109,)

In [24]:
model = Sequential()
model.add(LSTM(32, input_shape=(1, 3) ))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=0)


Out[24]:
<keras.callbacks.History at 0x111088208>

In [25]:
print("Training) R^2 score: {:.3f}".format(r2_score(y_train, model.predict(X_train))))
print("Testing)  R^2 score: {:.3f}".format(r2_score(y_test, model.predict(X_test))))


Training) R^2 score: 0.993
Testing)  R^2 score: 0.809

In [26]:
pred = model.predict(X_train)
plt.plot(y_train, label='Actual')
plt.plot(pred, label='Prediction')
plt.legend()


Out[26]:
<matplotlib.legend.Legend at 0x11c99cf28>

In [27]:
#saving
model_1_3 = model

Sequence model

  • 7 steps and 3 features

In [28]:
'''
Helper function to transform the dataset to
shapes defined by 7 steps and 3 features
'''
def prepare_sequence(data, sequence_size=7):
    sequence = []
    buckets = data.shape[0]//sequence_size
    init_sample = data.shape[0] - buckets*sequence_size
    samples = 0
    for i in range(init_sample, data.shape[0] - sequence_size + 1):
        sequence.append(data[i:i+sequence_size])
        samples += 1
    return np.concatenate(sequence).reshape((samples, sequence_size, data.shape[1]))

prepare_sequence(data[features]).shape


Out[28]:
(1079, 7, 3)

In [29]:
#getting (samples, steps, features)
X_train = prepare_sequence(data_train_norm[features])
X_test = prepare_sequence(data_test_norm[features])

y_train = data_train_norm.iloc[-len(X_train):].closed_price.values
y_test = data_test_norm.iloc[-len(X_test):].closed_price.values

In [30]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)


(967, 7, 3)
(967,)
(99, 7, 3)
(99,)

In [31]:
model = Sequential()
model.add(LSTM(32, input_shape=(7, 3) ))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, epochs=50, batch_size=32, verbose=0)


Out[31]:
<keras.callbacks.History at 0x11caa4f60>

In [32]:
print("Training) R^2 score: {:.3f}".format(r2_score(y_train, model.predict(X_train))))
print("Testing)  R^2 score: {:.3f}".format(r2_score(y_test, model.predict(X_test))))


Training) R^2 score: 0.993
Testing)  R^2 score: 0.897

In [33]:
pred = model.predict(X_train)
plt.plot(y_train, label='Actual')
plt.plot(pred, label='Prediction')
plt.legend()


Out[33]:
<matplotlib.legend.Legend at 0x11d274f60>

In [34]:
#saving
model_7_3 = model

⇩ SPECIAL NORMALIZATION FOR SEQUENCES ⇩

The neural network is not able to get good predictions for that data that has not seen before. For that reason we can find day that are not well fitted. This problem is related to the 'out-of-scale' data inputs.

Custom: window steps by the rate of change

Thinking that the batch size is a window of days that defines how the neural network learns, one idea is to normalize the window by the last sample. On this way we'll be able to keep almost all data in the same scale.


In [35]:
def print_mean_std(data):
    mean = np.mean(data)
    std = np.std(data)
    print("mean:{:.3f} std:{:.3f}".format(mean, std))

In [36]:
def window_normalization(data, window_size):
    y = np.empty_like(data, dtype='float64')
    normalizer = list()
    for i in range(0,len(data), window_size):
        j = min(i+window_size, len(data))
        y[i:j] = data[i:j]/np.abs(data[j-1])
        normalizer.append(np.abs(data[j-1]))
        #print_mean_std(y[i:j])
        
    return y, normalizer

def window_denormalization(norm_data, normalizer, window_size):
    y = np.empty_like(norm_data, dtype='float64')
    idx = 0
    for i in range(0,len(norm_data), window_size):
        j = min(i+window_size, len(norm_data))
        y[i:j] = norm_data[i:j]*normalizer[idx]
        idx += 1
        
    return y

In [37]:
#testing the function
a = np.array([[1, 1, 1], [2, 2, 2], [2, 2, 2], [8, 8, 8]])
expected_result = np.array([[0.5, 0.5, 0.5], [1, 1, 1], [0.25, 0.25, 0.25], [1, 1, 1]])
norm_a, normalizer = window_normalization(a, 2)

assert ( np.array_equal(norm_a, expected_result) )
assert ( np.array_equal(a, window_denormalization(norm_a, normalizer, 2)) )

In [38]:
#Showing the last sample
data.index[-1].strftime("%d-%m-%Y")


Out[38]:
'28-07-2018'

In [39]:
window_size=32

X_train = data_train[features].values
y_train = data_train.closed_price.values

X_train_norm, _ = window_normalization(X_train, window_size)
y_train_norm, y_normalizer = window_normalization(y_train, window_size)

#getting (samples, steps, features)
X_train_norm = prepare_sequence(X_train_norm)
y_train_norm = y_train_norm[-len(X_train_norm):]

In [40]:
model = Sequential()
model.add(LSTM(32, input_shape=(7,3) ))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train_norm, y_train_norm, epochs=50, batch_size=window_size, verbose=0)


Out[40]:
<keras.callbacks.History at 0x11d50ae80>

In [41]:
X_test = data_test[features].values
y_test = data_test.closed_price.values

X_test_norm, _ = window_normalization(X_test, window_size)
y_test_norm, y_scaler = window_normalization(y_test, window_size)

#getting (samples, steps, features)
X_test_norm = prepare_sequence(X_test_norm)
y_test_norm = y_test_norm[-len(X_test_norm):]

In [42]:
print("Training) R^2 score: {:.3f}".format(r2_score(y_train_norm, model.predict(X_train_norm))))
print("Testing)  R^2 score: {:.3f}".format(r2_score(y_test_norm, model.predict(X_test_norm))))


Training) R^2 score: 0.921
Testing)  R^2 score: 0.830

In [43]:
pred = model.predict(X_train_norm)
plt.plot(y_train_norm, label='Actual')
plt.plot(pred, label='Prediction')
plt.legend()


Out[43]:
<matplotlib.legend.Legend at 0x11dc4fda0>

In [44]:
#saving 
model_win = model

Testing the Model

Seeing the last results our best chance of accurate predictions (at a glance) is to use:

  • LSTM sequence by 7 steps and 3 features
  • Data Standardization

In [45]:
X_test = prepare_sequence(data_test_norm[features])
y_test = data_test_norm.iloc[-len(X_test):].closed_price.values

pred = model_7_3.predict(X_test)
plt.plot(y_test, label='Actual')
plt.plot(pred, label='Prediction')
plt.legend()


Out[45]:
<matplotlib.legend.Legend at 0x11dd97f98>